Skip to content

Fix background database account refresh stopping in multi-writer accounts#48758

Draft
jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer
Draft

Fix background database account refresh stopping in multi-writer accounts#48758
jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 10, 2026

Problem

The GlobalEndpointManager background refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).

Root Cause

In refreshLocationPrivateAsync(), when LocationCache.shouldRefreshEndpoints() returns false, the timer is never restarted:

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    this.isRefreshing.set(false);
    return Mono.empty(); // timer dies here
}

For multi-writer accounts, shouldRefreshEndpoints() returns false when the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).

Behavioral Difference with .NET SDK

The .NET SDK handles this correctly in StartLocationBackgroundRefreshLoop() -- it only terminates when canRefreshInBackground is explicitly false, continuing even when ShouldRefreshEndpoints() returns false.

Fix

Add startRefreshLocationTimerAsync() to the else branch of refreshLocationPrivateAsync():

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    if (!this.refreshInBackground.get()) {
        this.startRefreshLocationTimerAsync();
    }
    this.isRefreshing.set(false);
    return Mono.empty();
}

Unit Tests

6/6 pass:

  • backgroundRefreshForMultiMaster: Updated assertion -- timer must keep running
  • backgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW-to-SW transition via mock

Live DR Drill Validation (4 Scenarios)

Date: 2026-04-10 22:10Z -- 2026-04-11 00:32Z | Branch: fix/background-refresh-multi-writer @ 2048abeca

All scenarios used Direct + Gateway modes simultaneously. Kusto data from BackendEndRequest5M (Direct) and Request5M (Gateway).

Accounts

Account Type Regions
bgrefresh-mw-test-440 Multi-writer East US (hub) + West US
bgrefresh-sw-test-440 Single-writer East US (write) + West US (read)

Scenario 1: MW -- Offline Secondary Region

Global endpoint, preferred = West US. Offline West US, observe failover to East US.

S1

PASS -- Failover to East US in ~4 min. 32 GEM refreshes. West US traffic resumed after restore.

Scenario 2: MW -- MW-to-SW-to-MW Transition (Core PR validation)

Regional endpoint (westus.documents.azure.com), no preferred region. Disable then re-enable multi-write.

PASS -- Both transitions detected. MW-to-SW in ~3.5 min (writes shifted to EUS). SW-to-MW in ~1 min (writes returned to WUS). 28 GEM refreshes.

Scenario 3: SW -- Switch Write Region

Global endpoint, preferred = East US. Switch write EUS-to-WUS.

S3

PASS -- Writes on WUS within 1 Kusto bucket. 20 GEM refreshes.

Scenario 4: SW -- Offline Write Region

Global endpoint, preferred = East US. Offline East US.

S4

PASS -- Full failover to WUS in ~3 min. 32 GEM refreshes.

Backend Success Rates

Direct mode (BackendEndRequest5M)

Scenario Workload Total Success Rate
S1 MW Offline dr-off-direct-write 262,576 262,548 99.989%
S1 MW Offline dr-off-direct-read 292,092 290,830 99.568%
S2 MW Trans. dr-mwsw-direct-write 175,272 -- --
S2 MW Trans. dr-mwsw-direct-read 131,586 -- --
S3 SW Switch dr-direct-write 142,567 142,499 99.952%
S3 SW Switch dr-direct-read 251,072 247,366 98.524%
S4 SW Offline dr-off-direct-write 197,669 197,633 99.982%
S4 SW Offline dr-off-direct-read 232,202 226,599 97.587%

Gateway mode (Request5M)

Scenario Workload Total Success Rate
S1 MW Offline dr-off-gw-write 469,579 469,518 99.987%
S1 MW Offline dr-off-gw-read 557,311 557,307 99.999%
S2 MW Trans. dr-mwsw-gw-write 147,864 146,494 99.073%
S2 MW Trans. dr-mwsw-gw-read 196,383 196,383 100.0%
S3 SW Switch dr-gw-write 133,214 133,146 99.949%
S3 SW Switch dr-gw-read 231,657 231,657 100.0%
S4 SW Offline dr-off-gw-write (included in S1 totals)
S4 SW Offline dr-off-gw-read (included in S1 totals)

All errors (403/3 write-to-read-only, 404/1002 session-not-available) were auto-retried by the SDK -- zero user-visible failures.

Verdict

Scenario Failover GEM Refreshes Direct Write % GW Write % Verdict
MW Offline Secondary ~4 min to EUS 32 99.989% 99.987% PASS
MW-to-SW-to-MW ~3.5 min / ~1 min 28 -- 99.073% PASS
SW Switch Write < 1 bucket 20 99.952% 99.949% PASS
SW Offline Write ~3 min to WUS 32 99.982% 99.987% PASS
Kusto Queries Used
// Direct mode ops (BackendEndRequest5M)
BackendEndRequest5M
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where GlobalDatabaseAccountName == '{account}'
| where UserAgent has 'dr-' | where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload

// Gateway mode ops (Request5M -- lowercase columns)
Request5M
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where globalDatabaseAccountName == '{account}'
| where userAgent has 'gw'
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, userAgent)
| summarize Total=sum(SampleCount), Success=sumif(SampleCount, statusCode < 400) by Workload
| extend SuccessRate=round(100.0 * Success / Total, 3)

// Write region transitions (MgmtDatabaseAccountTrace)
MgmtDatabaseAccountTrace
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where GlobalDatabaseAccount == '{account}'
| project TIMESTAMP, Location, LocationType, FederationId, Status

Changes

  • 1 file changed, 10 insertions (GlobalEndpointManager.java)
  • 1 file changed, 50 insertions, 1 deletion (GlobalEndPointManagerTest.java)

…unts

In multi-writer accounts, refreshLocationPrivateAsync() stops the background
refresh timer when shouldRefreshEndpoints() returns false. This means topology
changes (e.g., multi-write to single-write transitions) go undetected until
the next explicit refresh trigger.

The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background
refresh loop unconditionally - the loop only stops when canRefreshInBackground
is explicitly false, not when shouldRefreshEndpoints returns false.

This fix adds startRefreshLocationTimerAsync() to the else-branch of
refreshLocationPrivateAsync(), ensuring the background timer always reschedules
itself regardless of whether endpoints currently need refreshing.

Without this fix, after a multi-write -> single-write -> multi-write transition,
reads remain stuck on the primary region because the SDK never re-reads account
metadata to learn about the restored multi-write topology.

Unit tests updated:
- backgroundRefreshForMultiMaster: assertTrue (timer must keep running)
- backgroundRefreshDetectsTopologyChangeForMultiMaster: new test proving
  MW->SW transition detection via mock

Related: PR Azure#6139 (point #4 in description acknowledged this bug)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the fix/background-refresh-multi-writer branch from c95fb7b to 2048abe Compare April 10, 2026 20:51
jeet1995 and others added 2 commits April 10, 2026 20:57
…W switch, SW offline)

Kusto-backed evidence with charts for PR Azure#48758 validation.
Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer)
Branch: fix/background-refresh-multi-writer @ 2048abe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions, SW switch, SW offline)"

This reverts commit c9fc5c4.
@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant